Experiments on Sentence Boundary Detection

نویسندگان

  • Mark Stevenson
  • Robert J. Gaizauskas
چکیده

This paper explores the problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition systems. An experiment which determines the level of human performance for this task is described as well as a memorybased computational approach to the problem. 1 T h e P r o b l e m This paper addresses the problem of identifying sentence boundaries in the transcriptions produced by automatic speech recognition (ASR) systems. This is unusual in the field of text processing which has generally dealt with well-punctuated text: some of the most commonly used texts in NLP are machine readable versions of highly edited documents such as newspaper articles or novels. However, there are many types of text which are not so-edited and the example which we concentrate on in this paper is the output from ASR systems. These differ from the sort of texts normally used in NLP in a number of ways; the text is generally in single case (usually upper), unpunctuated and may contain transcription errors. 1 Figure 1 compares a short text in the format which would be produced by an ASR system with a fully punctuated version which includes case information. For the remainder of this paper errorfree texts such as newspaper articles or novels shall be referred to as "standard text" and the output from a speech recognition system as "ASR text". There are many possible situations in which an NLP system may be required to process ASR text. The most obvious examples are NLP systems which take speech input (eg. Moore et al. (1997)). Also, dictation software programs do not punctuate or capitalise their output but, if this information could be added to ASR text, the results would be far more usable. One of the most important pieces of inform1 Speech recognition systems are often evaluated in terms of word error rate (WER), the percentage of tokens which are wrongly transcribed. For large vocabulary tasks and speakerindependent systems, WER varies between 7% and 50%, depending upon the quality of the recording being recognised. See, e.g., Cole (1996). G00D EVENING GIANNI VERSACE ONE OF THE WORLDS LEADING FASHION DESIGNERS HAS BEEN MURDERED IN MIAMI POLICE SAY IT WAS A PLANNED KILLING CARRIED OUT LIKE AN EXECUTION SCHOOLS INSPECTIONS ARE GOING TO BE TOUGHER TO FORCE BAD TEACHERS OUT AND THE FOUR THOUSAND COUPLES WH0 SHARED THE QUEENS GOLDEN DAY Good evening. Gi~nni Versace, one of the world's leading fashion designers, has been murdered in Miami. Police say it was a planned killing carried out like an execution. Schools inspections are going to be tougher to force bad teachers out. And the four thousand couples who shared the Queen's golden

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Automatic Sentence Boundary Detection with Confusion Networks

We extend existing methods for automatic sentence boundary detection by leveraging multiple recognizer hypotheses in order to provide robustness to speech recognition errors. For each hypothesized word sequence, an HMM is used to estimate the posterior probability of a sentence boundary at each word boundary. The hypotheses are combined using confusion networks to determine the overall most lik...

متن کامل

برچسب‌زنی نقش معنایی جملات فارسی با رویکرد یادگیری مبتنی بر حافظه

Abstract Extracting semantic roles is one of the major steps in representing text meaning. It refers to finding the semantic relations between a predicate and syntactic constituents in a sentence. In this paper we present a semantic role labeling system for Persian, using memory-based learning model and standard features. Our proposed system implements a two-phase architecture to first identify...

متن کامل

A Word Labeling Approach to Thai Sentence Boundary Detection and POS Tagging

Previous studies on Thai Sentence Boundary Detection (SBD) mostly assumed a sentence ends at a space and formulated the task SBD as a disambiguation problem, which classified a space either as an indicator for Sentence Boundary (SB) or non-Sentence Boundary (nSB). In this paper, we propose a word labelling approach which treats the space character as a normal word, and detects SB between any tw...

متن کامل

Community Detection using a New Node Scoring and Synchronous Label Updating of Boundary Nodes in Social Networks

Community structure is vital to discover the important structures and potential property of complex networks. In recent years, the increasing quality of local community detection approaches has become a hot spot in the study of complex network due to the advantages of linear time complexity and applicable for large-scale networks. However, there are many shortcomings in these methods such as in...

متن کامل

Viewing sentence boundary detection as collocation identification

The detection of abbreviations is an important step in the process of sentence boundary detection. We describe a flexible, languageindependent and accurate method based on the idea that an abbreviation can be viewed as a collocation. As such, it can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a goo...

متن کامل

Dependency structure analysis and sentence boundary detection in spontaneous Japanese

This paper addresses automatic detection of dependencies between Japanese phrasal units called bunsetsus, and sentence boundaries in a spontaneous speech corpus. In spontaneous speech, the biggest problem with dependency structure analysis is that sentence boundaries are ambiguous. In this paper, we propose two methods for improving the accuracy of sentence boundary detection in spontaneous Jap...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000